Regression using m-estimators and polynomial kernel support vector machines and principal component regression

ABSTRACT

Embodiments of the invention relate to sketching for M-estimators for performing regression. One embodiment includes providing one or more sets of input data. A matrix A and a vector b are generated using the input data. A processor device is used for processing the matrix A and the vector b based on a randomized sketching matrix S. A vector x that minimizes a normalized measure function is determined based on the matrix A and the vector b. A relationship between the input data is determined based on the vector x.

This invention was made with Government support under FA8750-12-C-0323 P00003 awarded by Defense Advanced Research Projects Agency (DARPA). The Government has certain rights in this invention.

BACKGROUND

Embodiments of the invention relate to regression processing and, in particular, sketching for M-estimators for performing regression and providing an oblivious subspace embedding (OSE) for embedding a space induced by a nonlinear kernel.

Linear regression is widely used in biological, behavioral and social sciences to describe possible relationships between variables. It ranks as one of the most important tools used in these disciplines. Regression processing sometimes involves the tool of sketching, which in generality is a descendent of random projection methods. Sketching has emerged as a powerful dimensionality reduction technique for accelerating statistical learning techniques such as l_(p)-regression, low rank approximation, principal component analysis (PCA), and support vector machines. For natural settings of parameters, the sketching technique has led to asymptotically optimal algorithms for a number of applications, often providing a speedup of order of magnitude over slower exact algorithms given by, e.g., the singular value decomposition (SVD). Oblivious subspace embedding (OSE) is essentially a data-independent random transform which is an approximate isometry over a subspace. It is crucial for any reasonable use of an OSE that applying it to a vector or a collection of vectors (a matrix) can be done in time that is faster than that of the nave algorithm, or at least faster than the intended downstream use.

Conventional OSEs are for subspaces that have a representation as the column space of an explicitly provided matrix, or close variants of it, and they admit a fast multiplication given an explicit representation of a vector or matrix (either dense or sparse). This is quite unsatisfactory in many statistical learning settings. In many cases the input may be described by a moderately sized n-by-d sample-by-feature matrix A, but the actual learning is done in a much higher (possibly infinite) dimensional space, by mapping each row of A to the higher dimensional feature space.

BRIEF SUMMARY

Embodiments of the invention relate to sketching for M-estimators for performing regression. One embodiment includes a method that includes providing one or more sets of input data. A matrix A and a vector b are generated using the input data. A processor device is used for processing the matrix A and the vector b based on a randomized sketching matrix S. A vector x that minimizes a normalized measure function is determined based on the matrix A and the vector b. A relationship between the input data is determined based on the vector x.

These and other features, aspects and advantages of the present invention will become understood with reference to the following description, appended claims and accompanying figures.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an example of a network environment for sketching for M-estimators for performing regression and OSE embedding, according to an embodiment of the present invention;

FIG. 2 is a block diagram illustrating an example of a server including a system utilizing sketching for M-estimators for performing regression and OSE embedding, according to an embodiment of the present invention, as shown in FIG. 1;

FIG. 3 illustrates a block diagram of an example computing device for sketching for M-estimators for performing regression, in accordance with an embodiment of the invention;

FIG. 4 is a block diagram showing a process for sketching for M-estimators for performing regression, in accordance with an embodiment of the invention; and

FIG. 5 is a block diagram showing a process for sketching using an OSE for embedding a space induced by a nonlinear kernel, in accordance with an embodiment of the invention.

DETAILED DESCRIPTION

Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products, according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

FIG. 1 illustrates an example of the basic components of an information technology system 10 utilizing system 100, according to an embodiment of the present invention. The information technology system 10 includes a server 11 and remote devices 15 and 17-20 that may utilize the system 100 of the present invention. In one embodiment, the server 11 implements the system 100 of the present invention.

Each of the remote devices 15 and 17-20 has applications and can have a local database 16. Server 11 contains applications, and is connected to a database 12 that can be accessed by remote device 15 and 17-20 via connections 14(A-F), respectively, over a network 13. The server 11 executes software for a computer network and controls access to itself and database 12. The remote device 15 and 17-20 may access the database 12 over the network 13, such as but not limited to: the Internet, a local area network (LAN), a wide area network (WAN), via a telephone line using a modem (POTS), Bluetooth, WiFi, WiMAX, cellular, optical, satellite, RF, Ethernet, magnetic induction, coax, RS-485, the like or other like networks. The server 11 may also be connected to the local area network (LAN) within an organization.

The remote device 15 and 17-20 may each be located at remote sites. Remote device 15 and 17-20 include but are not limited to, PCs, workstations, laptops, handheld computers, pocket PCs, PDAs, pagers, WAP devices, non-WAP devices, cell phones, palm devices, printing devices, and the like. Included with each remote device 15 and 17-20 is an ability to request relevant material from a large collection of documents via search queries to the server 11. Thus, when a user at one of the remote devices 15 and 17-20 desires to access the system 100 and the database 12 at the server 11, the remote device 15 and 17-20 communicates over the network 13, to access the system 100, the server 11 and database 12.

Third party computer systems 21 and databases 22 can be accessed by the server 11 in order to provide access to additional collections of documents and/or search indexes. Data that is obtained from third party computer systems 21 and database 22 can be stored on server 11 and database 12 in order to provide later access to the user on remote devices 15 and 17-20. It is also contemplated that for certain types of data, the remote devices 15 and 17-20 can access the third party computer systems 21 and database 22 directly using the network 13.

In one or more embodiments, the system 100 utilizes a process for sketching for M-estimators for performing regression and/or OSE embedding. One or more embodiments provide processing for the M-estimators of min_(x)∥Ax−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by a cost or measure function G:

⁺, with ∥y∥_(G)≡Σ_(i) G(y_(i)), where A is a matrix (n×d), b, x (e.g., explanatory, unknown) and y (e.g., dependent, measurements, samples, etc.) are vectors, and i, n and i are positive integers. M-estimators generalize l_(p) (e.g., least squares, robust, etc.) regression, for which G(x)=|x|^(p), where p ε[1, ∞]. In one embodiment, M-sketch is introduced, which includes a randomized construction of a sketching matrix S so that SAx may be used to estimate ∥Ax∥_(G), for all xε

^(a). This construction yields a single sketching process for all M-estimators requiring O(nnz(A)+poly(d log n)) time (where nnz is non-zero entries of the matrix A) and making one pass over the matrix in order to find a solution whose residual error is within a constant factor of optimal (e.g., as selected), with constant probability. In one example embodiment, an M-estimators using a Huber measure can be computed up to relative error ε in O(nnz(A)log n+poly(d(log n)/ε)) time. One or more embodiments provide improvements for subspace embeddings for l₁ (robust regression), and for dimensionality reduction from l₁ into l₁.

In one embodiment, system 100 provides fast oblivious subspace embeddings for spaces induced by a non-linear kernel without explicitly mapping the data to the high-dimensional space. In one example, OSEs for mappings induced by the polynomial kernel are implemented. The polynomial kernel is an important kernel often used in, e.g., natural language processing. In other examples, popular kernels such as the Gaussian kernel may sometimes be approximated by the polynomial kernel using a Taylor series. In one embodiment, the implemented OSEs may be used to obtain faster algorithms for the polynomial kernel. Namely, faster processes are obtained for kernel support vector machines (VMs), low rank approximation and approximate kernel principal component analysis (PCA), and approximate principal component regression (PCR).

One or more embodiments may employ the system 100 for regression applications involving, biological, behavioral and social sciences, trend estimations, epidemiology, finance, economics, environmental science, astronomy, bioinformics, medicine, engineering, control theory, image analysis, computer science, Internet analysis, data analysis, etc.

Illustrated in FIG. 2 is a block diagram demonstrating an example of server 11, as shown in FIG. 1, utilizing the system 100 according to an embodiment of the present invention. The server 11 includes, but is not limited to, PCs, workstations, laptops, PDAs, palm devices, and the like. The processing components of the third party computer systems are similar to that of the description for the server 11 (FIG. 2).

Generally, in terms of hardware architecture, as shown in FIG. 2, the server 11 includes a processor 41, a computer readable medium such as memory 42, and one or more input and/or output (I/O) devices (or peripherals) that are communicatively coupled via a local interface 43. The local interface 43 can be, for example but not limited to, one or more buses or other wired or wireless connections, as is known in the art. The local interface 43 may have additional elements, which are omitted for simplicity, such as controllers, buffers (caches), drivers, repeaters, and receivers to enable communications. Further, the local interface 43 may include address, control, and/or data connections to enable appropriate communications among the aforementioned components.

The processor 41 is a hardware device for executing software that can be stored in memory 42. The processor 41 can be virtually any custom made or commercially available processor, a central processing unit (CPU), data signal processor (DSP) or an auxiliary processor among several processors associated with the server 11, and a semiconductor based microprocessor (in the form of a microchip) or a microprocessor.

The memory 42 can include any one or combination of volatile memory elements (e.g., random access memory (RAM), such as dynamic random access memory (DRAM), static random access memory (SRAM), etc.) and nonvolatile memory elements (e.g., read only memory (ROM), erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), tape, compact disc read only memory (CD-ROM), disk, diskette, cartridge, cassette or the like, etc.). Moreover, the memory 42 may incorporate electronic, magnetic, optical, and/or other types of storage media. Note that the memory 42 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by the processor 41.

The software in memory 42 may include one or more separate programs, each of which comprises an ordered listing of executable instructions for implementing logical functions. In the example illustrated in FIG. 2, the software in the memory 42 includes a suitable operating system (O/S) 51 and the search system 100 of the present invention. The system 100 comprises functional components and process blocks described further below.

The operating system 51 essentially controls the execution of other computer programs, such as the system 100, and provides scheduling, input/output control, file and data management, memory management, and communication control and related services. However, the system 100 of the present invention is applicable on all other commercially available operating systems.

The system 100 may comprise a source program, executable program (object code), script, or any other entity comprising a set of computer program instructions to be performed. When the system 100 is a source program, then the program is usually translated via a compiler, assembler, interpreter, or the like, which may or may not be included within the memory 42, so as to operate properly in connection with the O/S 51. Furthermore, the system 100 can be written as (a) an object oriented programming language, which has classes of data and methods, or (b) a procedure programming language, which has routines, subroutines, and/or functions. The computer program instructions may execute entirely on server 11, partly on the server 11, as a stand-alone software package, partly on server 11 and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The I/O devices may include input devices, for example but not limited to, a mouse 44, keyboard 45, scanner (not shown), microphone (not shown), etc. Furthermore, the I/O devices may also include output devices, for example but not limited to, a printer (not shown), display 46, etc. Finally, the I/O devices may further include devices that communicate both inputs and outputs, for instance but not limited to, a NIC or modulator/demodulator 47 (for accessing remote devices, other files, devices, systems, or a network), a radio frequency (RF) or other transceiver (not shown), a telephonic interface (not shown), a bridge (not shown), a router (not shown), etc.

If the server 11 is a PC, workstation, intelligent device or the like, the software in the memory 42 may further include a basic input output system (BIOS) (omitted for simplicity). The BIOS is a set of essential software routines that initialize and test hardware at startup, starts the O/S 51, and supports the transfer of data among the hardware devices. The BIOS is stored in some type of read-only-memory, such as ROM, PROM, EPROM, EEPROM or the like, so that the BIOS can be executed when the server 11 is activated.

When the server 11 is in operation, the processor 41 is configured to execute software stored within the memory 42, to communicate data to and from the memory 42, and generally to control operations of the server 11 pursuant to the software. The system 100 and the O/S 51 are read, in whole or in part, by the processor 41, perhaps buffered within the processor 41, and then executed.

FIG. 3 shows an implementation of system 100 as a computing device 300 for sketching for M-estimators for performing regression and providing an oblivious subspace embedding (OSE) for embedding a space induced by a nonlinear kernel. In one embodiment, server 300 comprises a storage module 310, a matrix processing module 320, a vector processing module 330, a sketching module 340 and a regression process 350. In one embodiment, computing device 300 may extend and improve on O(nnz(A)) sketching results using the sketching module 340, matrix processing module 320, vector processing module 330 and the regression process 350 by giving a single sketching matrix construction, which is referred to herein as an M-sketch, that encompasses not only the l₂ (linear regression) and l₁ (robust regression) cases, but also the general category of statistical M-estimators, which seek to combine the insensitivity to outliers of l₁ regression with the low variance of l₂ regression. In one embodiment, the estimators are specified by a cost or measure function G:

⁺, where G(x)=G(−x), G(0)=0, and G is non-decreasing in |x|. The result is a new “norm” ∥y∥_(G)≡Σ_(iε[n]) G(y_(i)). It should be noted that in general, the functions ∥ ∥_(G) are not true norms, but are sometimes referred to as such herein. In one embodiment, the M-sketch is constructed independently of G, and the same sketch may be used for all G. Thus, the input from the storage module 310 may be sketched by the sketching module 340, and a particular choice of a penalty function is determined later by the regression process 350.

In one example, the Huber norm is specified by a parameter τ>0, and its measure function H is given by

${H(a)} \equiv \left( \begin{matrix} {a^{2}\text{/}2\tau} & {{{if}{a}} \leq \tau} \\ {{a} - {\tau \text{/}2}} & {{otherwise},} \end{matrix} \right.$

combining an l₂-like measure for small x with an l₁-like measure for large x. In one embodiment, the sketching module 340 provides an M-sketching scheme that comprises a sketching matrix Sε

^(t×n) and the vector processing module provides a scaling vector wε

^(t). In one example, for zε

^(t), let ∥z∥_(G,w) denote Σ_(iε[t]) w_(i)G(z_(i)). In one embodiment, the M-sketch SA of the matrix A has the property that ∥SAx∥_(G,w)≈∥Ax∥_(G), under conditions that are enough to show that the G-regression problem min_(xε)

_(d) ∥Ax−b∥_(G) may be reduced in O(nnz(A)) time to the solution of a similar kind of problem min_(xε)

_(d) ∥S(Ax−b)∥_(G,w) (with a side constraint), yielding an O(log n) approximation with constant probability over the random choice of S. In one example embodiment, the scaling vector w is chosen deterministically by the vector processing module 330.

In one example embodiment, the regression process 350 changes the measure on SAx from ∥SAx∥_(G,w) to a “clipped” version ∥ ∥_(Gc,w), in which only some of the larger coordinates of the sketch vector SAx are used when evaluating the norm. This variation allows a sketching process implemented by the sketching module 340 that provides a constant-factor approximation to min_(x)∥Ax−b∥_(G) with constant probability.

In one example, additional conditions on the function G beyond symmetry and monotonicity may be applied, such as that it grows no faster than quadratically in x, and no slower than linearly. Subquadratic growth is appropriate for robust regression, to reduce the effect of large values in the residual Ax−b, relative to their effect in least-squares. Almost all proposed M-estimators satisfy these conditions. The latter linear lower bound on the growth of G holds for all convex G, and many M-estimators have convex G. Moreover, the convexity of G implies the convexity of ∥ ∥_(G), which is needed for computing a solution to the minimization problem in polynomial time. Convexity also implies significant properties for the statistical interpretation of the results, such as consistency and asymptotic normality. In one embodiment, G is not required to be convex for sketching results from the sketching module 340. Some M-estimators are not convex. Therefore, in one embodiment the regression process 350 reduces a large non-convex problem, min_(x)∥Ax−b∥_(G), to a smaller non-convex problem min_(x)∥S(Ax−b)∥_(G,w) of a similar kind. The linear growth lower bound implies that is it unable to apply sketching by the sketching module 340 to some proposed M-estimators; for example, the “Tukey” estimator, whose G function is constant for large argument values, is not included in results. However, in one or more embodiments, the regression process 350 may get close, in the sense that at the cost of more computation, G functions that grow arbitrarily slowly are handled by the regression process 350.

In one or more embodiments, the regression process 350 obtains optimal O(nnz(A)+poly(d log n)) time approximation processes for the M-estimators, and the sketch non-trivially reduces the dimension of any of these estimators other than the l_(p)-norms (which are a special case of M-estimators). With conventional regression algorithms, for the L₁-L₂ estimator in which G(x)=2(√{square root over (1+x²/2)}−1), the Fair estimator in which

${{G(x)} = {c^{2}\left\lbrack {\frac{x}{c} - {\log \left( {1 + \frac{x}{c}} \right)}} \right\rbrack}},$

or the Huber estimator, no dimensionality reduction for the regression problem is known.

In one embodiment, for general M-estimators, the regression process 350 provides constant-factor approximations for O(nnz(A)+poly(d log n))-time one-pass implementations. In one embodiment, for the case of the Huber norm, a sampling scheme based on a combination of Huber's l₁ and l₂ properties may be used to obtain an implementation yielding an ε-approximation with respect to the Huber norm of the residual. The Huber norm is of particular interest, because it is popular and recommended for many situations as it is robust in a certain sense, and because it has the useful computational and statistical properties implied by the convexity and smoothness of its defining function. The smoothness makes it differentiable at all points, which can lead to computational savings over l₁, while enjoying the same robustness properties with respect to outliers. Additionally, l₁ for instance, treats small residuals “as seriously” as large residuals, whereas it is often more appropriate to have robust treatment of large residuals and Gaussian treatment of small residuals.

In one embodiment, the M-sketch yields an improved l₁ subspace embedding than conventional regression processes. For G(x)=|x|, the M-sketch value may be denoted as Σ_(iε[t]) w_(i)|z_(i)|=∥D(w)z∥₁, where D(w) is the diagonal matrix W with W_(ii)=w_(i); for G(x)=|x|, ∥Sy∥_(G,w)=∥D(w)Sy∥_(G)=∥D(w)Sy∥₁. In one embodiment, with constant probability, the M-sketch SA has Ω(1)∥Ax∥₁≦∥D(w)SAx∥₁≦O(d)∥Ax∥₁, for all xε

^(d), and hence has distortion O(d).

One or more embodiments using the regression process 350 have results involving a sketching dimension that is polynomial in d. In one example embodiment, with a sketching dimension that is poly(d/ε)^(1/ε), M-sketches have the dilation bound ∥D(w)Sy∥₁≦(1+ε)∥y∥₁ for a single vector y, with constant probability. Together with the contraction bound ∥D(w)Sy∥₁≧(1−ε)∥y∥₁, this is an improvement over a construction due to Indyk, for which the probability of success is ε. In one embodiment, the results for l₁ are built on bounds on the contraction and dilation of D(w)S, based on theorems that state that ∥D(w)SAx∥₁ will not be too much smaller than ∥Ax∥₁, with high probability, and cannot be too much larger, with constant probability.

In one embodiment, the sketching module 340 provides the M-sketch that is novel for l₁, because unlike many conventional sketching schemes for l₁, the sketching module 340 does not use continuous random variables such as Cauchy or exponential. It should be noted that M-sketches do involve random variables that, like Cauchy, have an undefined expectation. It should also be noted that although there are sketches in the streaming literature for estimating l₁ values with sign (±1) random variables, the conventional techniques use operations such as the median to achieve high probability results (which is needed in one direction for regression) and which do not make them amenable to optimization.

In one embodiment, the sketching module 340 interacts with the matrix processing module 320 and the vector processing module 330 for implementing a variant construction, comprising a sequence of sketching matrices S₀, S₁, . . . S_(h) _(max) , for a parameter h_(max), each comprising a block of rows of the sketching matrix:

$S \equiv {\begin{bmatrix} S_{0} \\ S_{1} \\ S_{2} \\ \vdots \\ S_{h_{{ma}\; x}} \end{bmatrix}.}$

When applied to vector yε

^(n), each S_(h) ignores all but a subset L_(h) of n/b^(h) entries of y, where b>1 is a parameter, and where those entries are chosen uniformly at random (i.e., S_(h) may be factored as S′_(h)S″_(h), where S″_(h)ε

^(n/b) ^(h) ^(×n) samples row i of A by having column i with a single 1 entry, and the rest zero, and S″_(h) has only n/b^(h) nonzero entries).

In one embodiment, each S_(h) implements a particular sketching scheme from the sketching module 340 referred to as COUNT-SKETCH on its random subset. COUNT-SKETCH splits the coordinates of y into groups (“buckets”) at random, and adds together each group after multiplying each coordinate by a random ±1; each such sum constitutes a coordinate of S_(h)y. COUNT-SKETCH has been shown to be a good subspace embedding for l₂, implying here that the matrix S₀, which applies to all the coordinates of y=Ax, has the property that ∥S₀Ax∥₂ is a good estimator for ∥Ax∥₂ for all xε

^(d); in particular, each coordinate of S₀y is the magnitude of the l₂ norm of the coordinates in the contributing group.

In one embodiment, the sketching module 340 provides construction, based on l₂ embeddings, that are suitable for l₁, with ∥D(w)SAx∥₁ provided as an estimate of ∥Ax∥₁, where M-sketch is effective for that norm. In one example, an intuition is derived from considering the matrix S_(h) _(max) for the smallest random subset L_(h) _(max) of y=Ax to be sketched; one can think of S_(h) _(max) y as one coordinate of y=Ax, chosen uniformly at random and sign-flipped. The expectation of ∥S_(h) _(max) y∥₁ is Σ_(iε[n])∥y_(i)∥₂/n=∥y∥₁/n; with appropriate scaling from D(w), that smallest random subset yields an estimate of ∥y∥₁=∥Ax∥₁ (this scaling is where the values w are needed). The variance of this estimate is too high to be useful, especially when the norm of y is concentrated in one coordinate, say y₁=1, and all other coordinates zero. For such a y, however, ∥y∥₂=∥y∥₁, so the base level estimator ∥S₀y∥₂ provides a good estimate. On the other hand, when y is the vector with all coordinates 1/n, the variance of ∥S_(log) _(b) _(n)y∥₁ is zero, while ∥S₀y∥₂=∥y∥₂ is quite inaccurate as an estimator of ∥y∥₁. So in these extreme cases, the extreme ends of the M-sketch are effective. The intermediate matrices S_(h) of the M-sketch serve to help with less extreme cases of y-vectors.

In one embodiment, another intuition about the usefulness of the M-sketch for robust estimation is derived from considering data A and b coming from a dataset (from the storage module 310) with inliers and outliers: that is, there is x_(o) so that for most iε[n], A_(i:)x_(o)−b_(i) is very small, where A_(i:) denotes the i′th row of A, but for a few iε[n], A_(i:)x_(o)−b_(i) is very large, so that x_(o) is not a minimizer of ∥Ax−b∥₁=Σ_(i)|A_(i:)x−b_(i)|. In one embodiment, the sketching module 340, in constructing M-sketch, may determine that it may be likely that for several of the S_(h), the random subset L_(h) includes none of the coordinates in the small set corresponding to outlier residuals, so that ∥D(w)S(Ax−b)∥₁=Σ_(h)∥D(w_(h))S_(h)(Ax−b)∥₁ has many contributing terms for which the outliers have no effect. In one embodiment, the vector processing module 330 builds the scaling vector w so that the weighted contributions from each S_(h) are roughly equal; this reduces the effect of the outliers, and shift the solution toward x_(o).

In one embodiment, the regression process 350, for the Huber estimator, involves importance sampling of the (A_(i:), b_(i)), where a sampling matrix S′ is obtained such that ∥S′(Ax−b)∥_(H,w) is a useful approximation to ∥Ax−b∥_(H). The sampling probabilities are based on a combination of the l₁ leverage score vector uε

^(n), and the l₂ leverage score vector u′ε

^(n). In on embodiment, the regression process 350 uses the l₁ vector u from the vector processing module 330 to obtain good sampling probabilities for l₁ regression, and similarly for u′ and l₂. Since the Huber measure has a mixed l₁/l₂ character, a combination of l₁ and l₂ scores are used to obtain good sampling probabilities for Huber. In one embodiment, the Huber norm of a vector is bounded in terms of n, τ, and its l₁ and l₂ norms, and leads to a recursive sampling process.

In one or more embodiments, the general structure of the arguments is to show, as discussed above, that for a fixed xε

^(d) there are bounds on:

-   -   contraction, so with high probability, ∥SAx∥_(G,w) is not too         much smaller than ∥Ax∥_(G); and     -   dilation, so with constant probability, ∥SAx∥_(G,w) is not too         much bigger than ∥Ax∥_(G). The asymmetry in probabilities means         that some results are out of reach, but still allows         approximation processing by the regression module 350 for         min_(x)∥Ax−b∥_(G). (the distinction is blurred between applying         S to A for vectors xε         ^(d), and to [A b] for vectors [x−1]). If the optimum x_(OPT)         for the original problem has ∥S(Ax_(OPT) b)∥_(G) that is not too         large, then it will be be a not-too-large solution for the         sketched problem min_(x)∥S(Ax−b)∥_(G,w). If contraction bounds         hold with high probability for a fixed vector Ax, then a         standard argument shows that they hold for all x; thus, there         will be no x that gives a good, small ∥S(Ax−b∥_(G,w) and bad,         large ∥Ax−b∥_(G).

In one embodiment, the contraction and dilation bounds are shown on a fixed vector yε

^(n) by the vector module 330 splitting up the coordinates of y into groups (“weight classes”) of roughly equal magnitude. In one embodiment, a weight class W is then analyzed by the regression process 350 with respect to its cardinality: there will be some random subset (“level”) L_(ĥ) for which |W∩L_(ĥ)| is small relative to the number of rows of S_(ĥ) (each row of S_(ĥ) corresponds to a bucket, as an implementation of COUNT-SKETCH), and therefore the members of W are spread out from each other, in separate buckets. This implies that each member of W makes its own independent contribution to ∥Sy∥_(G,w), and therefore that ∥Sy∥_(G,w) will not be too small. Also, the level L_(ĥ) is chosen by the regression process 350 such that the expected number of entries of the weight class is large enough that the random variable |W∩L_(ĥ)| is concentrated around its mean, and hence this contribution from W is well-behaved.

In one embodiment, to show that ∥Sy∥_(G,w) will not be too big, it may be shown that W will not contribute too much to levels other than the “Goldilocks” level L_(ĥ): for h<ĥ, for which |L_(h)∩W| is expected to be large, the fact that members of W∩L_(h) will be crowded together in a single bucket implies they will cancel each other out, roughly speaking; or more precisely, the fact that the COUNT-SKETCH buckets have an expectation that is the l₂ norm of the bucket entries implies that if a bucket contains a large number of entries from one weight class, those entries will make a lower contribution to the estimate ∥Sy∥_(G,w) than they did for L_(ĥ). In one embodiment, for h a bit bigger than ĥ, W∩L_(h) will likely be empty, and W will make no contribution to ∥S_(h)y∥.

One or more embodiments may use the matrix processing module 320, the vector processing module 330 and the sketching module 340 to provide subspace embedding for polynomial kernel maps, which may be further used by the regression process 350. Consider the polynomial kernel of degree q (a positive real number) defined as k(x, y)=(

x, y

+c)^(q) for some constant c≧0. Without loss of generality it may be assumed henceforth that c=0 since the more general case can be reduced to it by adding a coordinate of value √{square root over (c)} to all of the data points. In one example, let φ(x) denote the function that maps a d-dimensional vector x to the d^(q)-dimensional vector formed by taking the product of all subsets of q coordinates of x, and let φ(A) denote the application of φ to the rows of A by the vector processing module 330 and the matrix processing module 320. In one example, φ is the map that corresponds to the polynomial kernel, that is k(x, y)=

φ(x), φ(y)

, so learning with the polynomial kernel corresponds to using φ(A) instead of A.

In one embodiment, the sketching module 340 designs a distribution over d^(q)×O(3^(q) n²/ε²) sketching matrices S so that the mapping φ(A)·S may be computed in O((A)q)+(3^(q) n/ε) time, where (A) denotes the number of non-zero entries of A. Further, with constant probability arbitrarily close to 1, for all n-dimensional vectors z, ∥z·φ(A)·S∥₂=(1±ε)∥z·φ(A)∥₂, that is, the entire row-space of φ(A) is approximately preserved.

In one embodiment, the sketching module 340 may implement different applications, such as SVM, approximate kernel PCA, PCR, etc. In one example, an n×k matrix V with orthonormal columns spans a rank-k (1+)-approximation of an n×d matrix A if ∥A−VV^(T)A∥_(F)≦(1+ε)∥A−A_(k)∥_(F). In one embodiment, the results from the sketching module 340 for a constant degree q:

-   -   For polynomial kernel SVM, when d>>n, existing algorithms would         either take O(dn²) time to form the n×n Gram matrix φ(A)         φ(A)^(T), or at least d^(q) time to compute φ(A)S for a         sketching matrix S. Once these are computed, the SVM can be         solved in (n/ε) time using standard methods. In contrast, in one         embodiment the process implemented by the sketching module 340         takes O((A))+(n/ε) time, which improves the leading order term         from O(dn²) to O((A)), the latter always being at most O(nd) but         can be significantly smaller for sparse matrices.     -   In O((A))+n·(k/) time the matric processing module 320 computes         an n×k matrix V with orthonormal columns for which         ∥φ(A)−VV^(T)φ(A)∥_(F)≦(1+)∥φ(A)−[φ(A)]_(k)∥_(F), where         [φ(A)]_(k) denotes the best rank-k approximation to φ(A). The         k-dimensional subspace V of         ^(n) can be thought of as an approximation to the top k left         singular vectors of φ(A).     -   Once the sketching module 340 determines the subspace V, a low         rank approximation to φ(A) is obtained. To do this, in one         embodiment the sketching module 340 samples a subset R of (k/)         rows of φ(A) according to the leverage scores of V, obtaining a         (1+) rank-k approximation to φ(A) given by V·U·R, where U is a         k×(k/) matrix. Here the row space of R contains a good         approximation to the span of the top k principal components of         φ(A). In one example, it is desired to represent this matrix         implicitly, since otherwise R would have d^(q) columns. This is         possible by storing the sampled rows A_(i) rather than the         sampled rows φ(A_(i)). In one embodiment, the sketching module         340 obtains three matrices V, U, and R from the matric         processing module 320, for which         ∥φ(A)−V·U·φ(R)∥_(F)≦(1+)∥φ(A)−[φ(A)]_(k)∥_(F). This         representation is useful, since given a point yε         ^(d), the sketching module 340 can compute φ(R)·φ(y) quickly         using the kernel trick. The total time to compute the low rank         approximation is O((A))+(n+d)·(k/). This is considerably faster         than standard kernel PCA which would need O(n² d) time to         compute the Gram matrix φ(A)·φ(A)^(T).     -   The subspace V can be used to regularize and speed up various         learning processes with the polynomial kernel. In one example,         using the regression process 350 the subspace V may be used to         solve regression problems of the form min_(x)∥Vx−b∥₂, an         approximate form of principal component regression. This can         serve as a form of regularization, which is required as the         problem min_(x)∥(A)x−b∥₂ is underdetermined. A popular         alternative form of regularization is to use kernel ridge         regression, which requires O(n²d) operations. As(A)≦nd, one or         more embodiments using the regression process 350 are faster.

For sketching polynomial kernels, the TensorSketch algorithm is well-known. TensorSketch combines the earlier CountSketch with a fast Fourier transform (FFT) in a clever way. What was not known about TensorSketch, which is crucial in order to use it as an OSE for applications with one or more embodiments, is whether it can be used to approximately preserve entire subspaces of points. Indeed, a fixed point vε

^(d) has the property that for the TensorSketch sketching matrix S, ∥φ(v)·S∥₂=(1±)∥φ(v)∥₂ with constant probability. To obtain a high probability bound using their results, a median of several independent sketches may be taken. Given a high probability bound (1−exp(−n)), one can use a net argument to show that the sketch is correct for all vectors v in an n-dimensional subspace of

_(d). The median operation results in a non-convex embedding, and it is not clear how to efficiently solve optimization problems in the sketch space with such an embedding. Moreover, the fact that n independent sketches are needed to achieve probability 1−exp(−n) means the running time will be at least n·(A), whereas one or more embodiments only seek an (A) time.

Conventional applications show that CountSketch can be used to provide a subspace embedding, that is, simultaneously for all vεV, ∥φ(v)·S∥₂=(1±)∥φ(v)∥₂. TensorSketch can be seen as a very restricted form of CountSketch, where the additional restrictions enable its fast running time on inputs which are tensor products. In particular, the hash functions in TensorSketch are only 3-wise independent. CountSketch still provides a subspace embedding if the entries are chosen from a 4-wise independent distribution.

It should be noted that previous work on sketching the polynomial kernel suffers from the drawback described above, that is, it provides no provable guarantees for preserving an entire subspace, which is needed, e.g., for low rank approximation. This is true even of the sketching methods for polynomial kernels that do not use TensorSketch, as it only provides tail bounds for preserving the norm of a fixed vector, and has the aforementioned problems of extending it to a subspace, i.e., boosting the probability of error to be enough to union bound over net vectors in a subspace would require increasing the running time by a factor equal to the dimension of the subspace.

An unusual aspect is that for a TensorSketch matrix S can compute φ(A)·S very efficiently. However, computing S·φ(A) is not known to be efficiently computable, and indeed, for degree-2 polynomial kernels this can be shown to be as hard as general rectangular matrix multiplication. In general, even writing down S·φ(A) would take a prohibitive d^(q) amount of time. Therefore, in one embodiment, the sketching module 340 is designed to only sketch on one side of φ(A).

The goal in random features maps is to construct randomized feature maps Ψ(•) so that the Euclidean inner product <ΨP(u),Ψ(v)> closely approximates the value of k (u, v) where k is the kernel; the mapping Ψ is dependent on the kernel. Theoretical analysis in this line of research has focused so far on showing that <Ψ(u),Ψ(v)> is indeed close to k (u, v). This is also the kind of approach that is used to analyze TensorSketch. The problem with this kind of analysis is that it is hard to relate it to downstream metrics like generalization error. In contrast, in one embodiment the sketching module 340 based on OSEs provides a computational framework for analyzing the mappings, to reason about their downstream use, and to utilize various tools from numerical linear algebra in conjunction with them.

In one example, the CountSketch data structure is described as follows. Let B be the dimension of CountSketch, which is the number of its hash buckets. When applied to d-dimensional vectors v, the data structure is specified by a 2-wise independent hash function h: [d]→[B] and a 2-wise independent sign function s: [d]→−1, +1. The value in the i-th bucket of the data structure, for i=1, 2, . . . , B, is given by Σ_(j|h(j)=i) s(j)v_(j). CountSketch can be represented as a B×d matrix in which the j-th column contains a single non-zero entry s(j) in the h(j)-th row.

The TensorSketch data structure is described as follows. Suppose a point vε

^(d) and so φ(v)ε

^(d) ^(q) are given. In one example, q 3-wise independent hash functions h₁, . . . , h_(g): [d]→[B], and q 2-wise independent sign functions s₁, . . . , s_(q): [d] {+1, 1} are chosen. The hash function H: [d^(q)]→[B] and sign function S: [d^(q)]→{+1, −1} are defined as follows:

H(i ₁ , . . . ,i _(g))=h ₁(i ₁)+h ₂(i ₂)+ . . . +h _(q)(i _(q))mod B,

and

S(i ₁ , . . . ,i _(q))=s ₁(i ₁)·s ₂(l ₁) . . . s _(q)(i _(g)).

It is well-known that if H is constructed this way, then it is 3-wise independent. Unlike a conventional technique, which implemented H as a 2-wise independent, in the following a stronger property of H is needed. To hash the point v, TensorSketch computes the polynomials

p _(l)(x)=Σ_(i=0) ^(B−1) x ^(i)Σ_(j|h) _(l) _((j)=i) v _(j) ·s _(l)  (j),

for l=1, 2, . . . , q. A calculation shows

Π_(l=1) ^(q)(x)mod(x ^(B)−1)=Σ_(i=0) ^(B−1) x ^(i)Σ_((j) ₁ _(, . . . ,j) _(q) _()|H(j) ₁ _(, . . . ,j) _(q) _()=i) v _(j) ₁ . . . v _(j) _(q) S(j ₁ , . . . ,j _(q))

that is, the product of the q polynomials can be viewed as a CountSketch data structure applied to the vector v

^(q) with hash function H and sign function S.

A key insight of Pagh is that this product of polynomials can be computed in O(qB log B) time using the Fast Fourier Transform. As it takes O(q(v)) time to form the q polynomials, the overall time to compute TensorSketch(v) is O(q((v)+B log B)). While an error analysis of TensorSketch is provided in the literature, it is only for a single point, and does not apply to properties of subspaces. In one or more embodiments, a new analysis of TensorSketch is provided for subspaces.

In one example embodiment, for OSE, let Π be the m×d^(q) matrix such that TensorSketch(v) is φ(v)Π⁷, where φ(v) is the row feature vector in d^(q). It will be shown that Π^(T) is an oblivious subspace embedding for subspaces in d^(q) for appropriate values of m. Notice that Π has exactly one non-zero entry per column. The index of the non-zero in the column (i₁, . . . , i_(g)) is H(i₁, . . . , i_(q))=Σ_(j=1) ^(q) h_(j)(i_(j))mod m. Let δ_(i,j) be the indicator random variable of whether Π_(i,j) is non-zero. The signs of the non-zero entry in column (i₁, . . . , i_(q)) is S(i₁, . . . , i_(q))=Π_(j=1) ^(q) s_(j)(i_(j)). The following lemmas show that Π can be used to approximate a matrix product and is a subspace embedding, according to one embodiment. Let A and B be matrices with d^(q) rows. For m=Θ(3^(q)/(²δ)):

Pr[∥A ^(T)Π^(T) ΠB−A ^(T) B∥ _(F) ²≦² ∥A∥ _(F) ² ∥B∥ _(F) ²]≧1−δ.

Proof. Let C=A^(T)Π^(T)ΠB. Then

C _(u,u′)=Σ_(t=1) ^(m)Σ_(i,jε[d]) _(q) S(i)S(j)δ_(t,i)δ_(t,j) A _(i,u) B _(j,u′)=Σ_(t=1) ^(m)Σ_(i≠jε[d]) _(q) S(i)S(j)δ_(t,i)δ_(t,j) A _(i,u) B _(j,u′)+(A ^(T) B)_(u,u′).

Thus, [C_(u,u′)]=(A^(T)B)_(u,u′). Next, the following is analyzed [((C−A^(T)B)_(u,u′))²]. This leads to:

((C−A ^(T) B)_(u,u′))²=Σ_(t) ₁ _(,t) ₂ ₌₁ ^(m)Σ_(i) ₁ _(≠j) ₁ _(,i) ₂ _(≠j) ₂ _(ε[d]) _(q) S(i ₁)S(i ₂)S(j ₁)S(j ₂)·δ_(t) ₁ _(,i) ₁ δ_(t) ₁ _(,j) ₁ δ_(t) ₂ _(,i) ₂ δ_(t) ₂ _(,j) ₂ .

A_(i) ₁ _(,u)A_(i) ₂ _(,u)B_(j) _(i) _(,u′)B_(j) ₂ _(,u′). For a term in the summation on the right hand side to contribute to [((S−A^(T)B)_(u,u′))²], it must be the case that [S(i₁)S(i₂)S(j₁)S(j₂)]=1. In other words, in each of the q coordinates, the 4 coordinates of i₁, i₂, j₁, j₂ must form matching pairs. In one example embodiment, let S₁ be the set of coordinates where i₁ and i₂ agrees. Note that j₁ and j₂ must also agree in all coordinates in S₁. Let S₂⊂[q]\S₁ be the coordinates among the remaining where i₁ and j₁ agrees. Finally, let S₃=[q]\(S₁∪S₂). All coordinates in S₃ of i₁ and j₂ must agree. It can be rewritten as i₁=(a, b, c), i₂=(a, e, f), j₁=(g, b, f), j₂=(g, e, c) where a, gε[d]^(s) ¹ , b, eε[d]^(s) ² , c, fε[d]^(S) ³ .

First it is shown that the contribution of the terms where i₁=i₂ or i₁=j₂ is bounded by

$\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m},$

where A_(u) is the uth column of A and B_(u), is the u′th column of B. Indeed, consider the case i₁=i₂. As observed before, j₁=j₂ must be used to get a non-zero contribution. Note that if t₁≠t₂, one always has δ_(t) ₁ _(,i) ₁ δ_(t) ₂ _(,i) ₂ =0 as H(i₁) cannot be equal to both t₁ and t₂. Thus, for fixed i₁=i₂, j₁=j₂,

$\left. {\left\lbrack {\sum_{t_{1},{t_{2} = 1}}^{m}{{S\left( i_{1} \right)}{S\left( i_{2} \right)}{S\left( j_{1} \right)}{{S\left( j_{2} \right)} \cdot \delta_{t_{1},i_{1}}} \delta_{t_{1},j_{1}} \delta_{t_{2},i_{2}} {\delta_{t_{2},j_{2}} \cdot}}}\quad \right.A_{i_{1},u} A_{i_{2},u} B_{i_{1},u^{\prime}} B_{j_{2},u^{\prime}}} \right\rbrack = {\left\lbrack {\sum_{t_{1} = 1}^{m}{\delta_{t_{1},i_{1}}^{2}\delta_{t_{1},j_{1}}^{2}A_{i_{1},u}^{2}B_{j_{1},u^{\prime}}^{2}}} \right\rbrack = {\frac{A_{i_{1},u}^{2}B_{j_{1},u^{\prime}}^{2}}{m}.}}$

Summing over all possible values of i₁, j₁, one embodiment obtains the desired bound of

$\frac{{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m}.$

The case i₁=j₂ is analogous.

In one embodiment, the contribution of the terms are computed by the sketching module 340, where i₁≠i₂,j₁,j₂ i.e., there are at least three (3) distinct numbers among i₁, i₂, j₁, j₂. Notice that

$\left\lbrack {\delta_{t_{1},i_{1}}\delta_{t_{1},j_{1}}\delta_{t_{2},i_{2}}\delta_{t_{2},j_{2}}} \right\rbrack \leq {\frac{1}{m^{3}}.}$

For fixed i₁, j₁, i₂, j₂, there are m² choices of t₁, t₂ so the total contribution to the expectation from terms with the same i₁, j₁, i₂, j₂ is bounded by

${m^{2} \cdot \frac{1}{m^{3}} \cdot {{A_{i_{1},u}A_{i_{2},u}B_{j_{1},u^{\prime}}B_{j_{2},u^{\prime}}}}} = {\frac{1}{m}{{{A_{i_{1},u}A_{i_{2},u}B_{j_{1},u^{\prime}}B_{j_{2},u^{\prime}}}}.}}$

Therefore,

${{\left\lbrack \left( \left( {C - {A^{T}B}} \right)_{u,u^{\prime}} \right)^{2} \right\rbrack \leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{1}{m}{\sum_{{partitionS}_{1},S_{2},S_{3}}{\sum_{a,{g \in {\lbrack d\rbrack}^{s_{1}}},b,{e \in {\lbrack d\rbrack}^{s_{2}}},c,{f \in {\lbrack d\rbrack}^{s_{3}}}}{{A_{{({a,b,c})},u}B_{{({g,b,f})},u^{\prime}}A_{{({a,e,f})},u}B_{{({g,e,c})},u^{\prime}}}}}}}} \leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}}{m}{\sum_{a,b,c,g,e,f}{{A_{{({a,b,c})},u}B_{{({g,b,f})},u^{\prime}}A_{{({a,e,f})},u}B_{{({g,e,c})},u^{\prime}}}}}}} \leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}}{m}{\sum_{g,e,f}{\left( {\sum_{a,b,c}A_{{({a,b,c})},u}^{2}} \right)^{1/2}\left( {\sum_{a,b,c}{B_{{({g,b,f})},u^{\prime}}^{2}A_{{({a,e,f})},u}^{2}B_{{({g,e,c})},u^{\prime}}^{2}}} \right)^{1/2}}}}}} = {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}{{A_{u}}}}{m}{\sum_{g,e,f}{\left( {\sum_{b}B_{{({g,b,f})},u^{\prime}}^{2}} \right)^{1/2}\left( {\sum_{a,c}{A_{{({a,e,f})},u}^{2}B_{{({g,e,c})},u^{\prime}}^{2}}} \right)^{1/2}}}}}},$

where the second inequality follows from the fact that there are 3^(q) partitions of [q] into 3 sets. The other inequalities are from Cauchy-Schwarz. Continuing to bound this expression,

${{\leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}{{A_{u}}}}{m}{\sum_{e}{\left( {\sum_{b,g,f}B_{{({g,b,f})},u^{\prime}}^{2}} \right)^{1/2}\left( {\sum_{a,c,g,f}{A_{{({a,e,f})},u}^{2}B_{{({g,e,c})},u^{\prime}}^{2}}} \right)^{1/2}}}}}} = {{{\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}{{{A_{u}}} \cdot {{B_{u^{\prime}}}}}}{m}{\sum_{e}{\left( {\sum_{a,f}A_{{({a,e,f})},u}^{2}} \right)^{1/2}\left( {\sum_{g,c}B_{{({g,e,c})},u^{\prime}}^{2}} \right)}}}} \leq {\frac{2{{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m} + {\frac{3^{q}{{{A_{u}}} \cdot {{B_{u^{\prime}}}}}}{m}\left( {\sum_{a,e,f}A_{{({a,e,f})},u}^{2}} \right)^{1/2}\left( {\sum_{g,e,c}B_{{({g,e,c})},u^{\prime}}^{2}} \right)^{1/2}}}} = \frac{\left( {2 + 3^{q}} \right){{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m}}},$

where the inequalities are applications of the Cauchy-Schwarz inequality.

Combining the above bounds, one has

$\left\lbrack \left( \left( {C\; - {A^{T}B}} \right)_{u,u^{\prime}} \right)^{2} \right\rbrack \leq {\frac{\left( {2 + 3^{q}} \right){{A_{u}}}^{2}{{B_{u^{\prime}}}}^{2}}{m}.}$

For m≧(2+3^(q))/(²δ), by the Markov inequality, ∥a^(T)Π^(T)ΠB−A^(T)B∥_(F) ²≦² with probability 1−δ. Consider a fixed k-dimensional subspace V. If m=Θ(k²3^(q)/(²δ)), then with probability at least 1−δ, ∥Πx∥=(1±)∥x∥ simultaneously for all xεV. Let B be an d^(q)×k matrix whose columns form an orthonormal basis of V. Thus, one has B^(T)B=I_(k) and ∥B∥_(F) ²=k. The condition that ∥Πx∥=(1±)∥x∥ simultaneously for all xεV is equivalent to the condition that the singular values of ΠB are bounded by 1±. For m=Θ(3^(q)/((/k)²δ)), with probability at least 1−δ, one has

∥B ^(T)Π^(T) ΠB−B ^(T) B∥ _(F) ²≦(/k)² ∥B∥ _(F) ⁴=²

Thus, ∥B^(T)Π^(T)ΠB−I_(k)∥≦∥B^(T)Π^(T)ΠB−I_(k)∥_(F)≦. In other words, the squared singular values of ΠB are bounded by 1±, implying that the singular values of ΠB are also bounded by 1±.

The Support Vector Machines (SVM) problem is to find a hyperplane that maximizes the 1-norm soft margin of a set of n input points A₁, . . . , A_(n)εR^(1×d) with respective labels y_(i)ε{−1, 1}. In one example embodiment, first it is stated what is known for the non-kernelized version of the problem, and then the results for the polynomial kernel version of the problem are stated. In one embodiment, the sketching module 340 assumes d>>n. In the non-kernelized version of the problem, if Aε

^(n×d) is the matrix whose rows are the A₁, Yε

^(n×n) is the diagonal matrix with entries Y_(i,i)=y_(i), and αε

^(n) is a vector of Lagrange multipliers, then the dual formulation of the SVM optimization problem is to find a given by the following optimization problem:

${{{{\max\limits_{\alpha}{1^{T}\alpha}} - {\frac{1}{2}\alpha^{T}{YAA}^{T}Y\; \alpha \; {subjectto}\; 1^{T}Y\; \alpha}} = 0};{{{and}\mspace{14mu} 0} \leq \alpha \leq C}},$

where 1^(T), 0, and C are vectors with the implied constant entry.

In one example embodiment, let α* be an optimal solution of the above problem. The optimal separating hyperplane w* is given by w*=A^(T)Yα*=Σ_(i=1) ^(n)y_(i)α_(i)*A_(i), and the points A_(i) for which α_(i)*>0 are called the support vectors. The geometric margin γ* is 1/∥w*∥₂, where it can be shown for this optimal hyperplane that ∥w*∥₂ ²=Σ_(i=1) ^(n)α_(i)*. The data radius is given by r=min_(x)max_(A) _(i) ∥A_(i)−x∥₂. The generalization performance of an SVM is typically characterized by the ratio r/γ*. The smaller the ratio, the better.

Let ε be an accuracy parameter, and suppose S is a subspace embedding for the rowspace of A, that is, for all zε

^(n), ∥zAS∥₂=(1±ε)∥zA∥₂. Suppose one replaces the matrix A in the SVM optimization problem with Ã=AS. Then if {tilde over (γ)}* is the margin obtained using Ã and {tilde over (r)} the radius obtained used Ã, then (1){tilde over (γ)}*≧(1−ε)·γ*, and (2) {tilde over (r)}≦(1+ε)r. One can compute Ã=AS in O((A)) time, and the number of columns of S is (n/ε). It follows that Ã is an n×(n/ε) matrix, so the resulting SVM problem no longer depends on d, and can be solved in (n/ε) time using standard methods. Hence, the overall time is O((A))+(n/ε) and the guarantee is that γ*≧(1−ε)·γ* and {tilde over (r)}≦(1+ε)r.

The difference with the polynomial kernel version of the problem is that A is replaced with φ(A) in the optimization problem, increasing the dimension to d^(q). Note that α is still n-dimensional. Explicitly writing down φ(A) and then computing φ(A)S for a sketching matrix S would take at least d^(q) time, which is intractable. One can improve this to O(n²d log q) time by using the kernel trick to compute φ(A)φ(A)^(T). One could then solve the resulting optimization problem in (n/ε) time. However, if d>>n, this can be quite slow. If A is sparse, then sparsity can potentially be used to compute φ(A)φ(A)^(T) faster, but it is hard to give a formal statement regarding the run-time as it depends on the sparsity structure (e.g., a single dense column in A will cause φ(A)φ(A)^(T) to be full).

In one embodiment, the following is provided for polynomial kernel SVM. In O((A)q)+(3^(q)n/ε) time, one can solve the SVM optimization problem on the polynomial kernel of degree q so as to obtain a hyperplane with margin {tilde over (γ)}*≧(1−ε)·γ* and radius {tilde over (r)}≦(1+ε)r.

Proof S is chosen to be a d^(q)×O(3^(q)n²/ε²) TensorSketch matrix. Using the fact that S provides a subspace embedding, it follows that (1) φ(A)S, and (2) φ(A)S can be computed in O(q((A)+3^(q)n³/ε² log(n/ε))) time and S has O(3^(q)n²/ε²) rows, according to one embodiment.

In one embodiment, the sketching module 340 provides approximate kernel PCA and low rank approximation. In one example, an n×k matrix V with orthonormal columns spans a rank-k (1+)-approximation of an n×d matrix A if ∥A−VV^(T)A∥_(F)≦(¹+ε)∥A−A_(k)∥_(F). One example starts by developing a process for finding an n×k matrix V which spans a rank-k (1+)-approximation of φ(A). In one embodiment, a process of the sketching module 340 is given as follows:

Process 1 k-Space:

1. Input: Aε^(n×d), ε (0,1], integer k.

2. Output: Vε^(n×k) with orthonormal columns which spans a rank-k (1+)-approximation to φ(A).

3. Set the parameters m=O(3^(q)k²+k/) and r=O(3^(q)m²/²).

4. Let S be a d^(q)×m TensorSketch and T be an independent d^(q)×r TensorSketch.

5. Compute φ(A) S and φ(A) T.

6. Let U be an orthonormal basis for the column space of φ(A)·S.

7. Let W be the m×k matrix containing the top k left singular vectors of U^(T)φ(A)T.

8. Output V=UW.

Before proving the correctness of the process 1 above, two key lemmas are provided. Let Sε

^(d) ^(q) ^(×m) be a randomly chosen TensorSketch matrix with m=O(3^(q)k²+k/). Let UU^(T) be the n×n projection matrix onto the column space of φ(A)·S. Then if [U^(T) φ(A)]_(k) is the best rank-k (1+)-approximation to matrix U^(T) φ(A), one has:

∥U[U^(T)φ(A)]_(k)−φ(A)∥_(F)≦(1+O( ))∥φ(A)−[φ(A)]_(k)∥_(F).

Let UU^(T) be as above. Let Tε

^(d) ^(q) ^(×r) be a randomly chosen TensorSketch matrix with r=O(3^(q)m²/²), where m=O(3^(q)k²+k/). Suppose W is the m×k matrix whose columns are the top k left singular vectors of U^(T)φ(A)T. Then,

∥UWW^(T)U^(T)φ(A)−φ(A)∥_(F)≦(1±)∥φ(A)−[φ(A)]_(k)∥_(F).

For the polynomial kernel of degree q, in O((A)q)+n·(3^(q)k/ε) time, the process 1 k-Space finds an n×k matrix V which spans a rank-k (1+)-approximation of φ(A). The output V=UW spans a rank-k (1+)-approximation to φ(A). It only remains to argue the time complexity. The sketches φ(A)·S and φ(A)·T can be computed in O((A)q)+n·(3^(q) k/) time by the sketching module 340. In n·(3^(q)k/) time, the matrix U can be obtained from φ(A)·S and the product U^(T) φ(A)T can be computed. Given U^(T) φ(A)T, the matrix W of top k left singular vectors can be computed in (3^(q)k/) time, and in n·(3^(q)k/) time the product V=UW can be computed. Hence the overall time is O((A)q)+n·(3^(q)k/ε).

In one embodiment, a low rank approximation to φ(A) may be found as follows (Polynomial Kernel PCA and Low Rank Factorization). For the polynomial kernel of degree q, in O((A)q)+(n+d)·(3^(q)k/ε) time, the sketching module 340 can find an n×k matrix V, a k×(k/ε) matrix U, and a (k/)×d matrix R for which

∥V·U·φ(R)−A∥_(F)≦(1+ε)∥φ(A)−[φ(A)]_(k)∥_(F).

The success probability of the algorithm is at least. 6. Note that it may be implied that the row space of φ(R) contains a k-dimensional subspace L with d^(q)×d^(q) projection matrix LL^(T) for which ∥φ(A)LL^(T)−φ(A)∥_(F)≦(1+ε)∥φ(A)−[φ(A)]_(k)∥_(F), that is, L provides an approximation to the space spanned by the top k principal components of φ(A).

In one embodiment, regularizing learning with the polynomial kernel may be provided by the sketching module 340. Consider learning with the polynomial kernel. If d<<n it might be that even for low values of q one will have d^(q)>>n. This makes a number of learning algorithms underdetermined, and increases the chance of overfitting. The problem is even more severe if the input matrix A has a lot of redundancy in it (noisy features). To address this, many learning algorithms add a regularizer, e.g., ridge terms. In one embodiment, the sketching module 340 regularizes by using using rank-k approximations to the matrix (where k is the regularization parameter that is controlled by the user).

In one embodiment, two different processes that can be regularized using the sketching module 340 are provided. In one example embodiment, approximate kernel PCR is provided. If d^(q)>n the linear regression with φ(A) becomes underdetermined and exact fitting to the right hand side is possible, and in more than one way. One form of regularization is PCR, which first uses PCA to project the data on the principal component, and then continues with linear regression in this space. In one example embodiment, the following approximate version of PCR is provided. In the approximate PCR problem (Approximate PCR), an n×d matrix A and an n×1 vector b is provided, and the goal is to find a vector xε

^(k) and an n×k matrix V with orthonormal columns spanning a rank-k (1+)-approximation to A for which x=argmin_(x)∥Vx−b∥₂. Notice that if A is a rank-k matrix, then Approximate PCR coincides with ordinary least squares regression with respect to the column space of A. While PCR would require solving the regression problem with respect to the top k singular vectors of A, in general finding these k vectors exactly results in unstable computation, and cannot be found by an efficient linear sketch. This would occur, e.g., if the k-th singular value σ_(k) of A is very close (or equal) to σ_(k+1). In one example embodiment, the definition is relaxed to only require that the regression problem be solved with respect to some k vectors which span a rank-k (1+)-approximation to A.

In one embodiment, the following is provided for Approximate PCR. (Polynomial Kernel Approximate PCR.). For the polynomial kernel of degree q, in O((A)q)+n·(3^(q)k/ε) time the sketching module 340 can solve the approximate PCR problem, namely, outputting a vector xε

^(k) and an n×k matrix V with orthonormal columns spanning a rank-k (1+)-approximation to φ(A), for which x=argmin_(x)∥Vx−b∥₂. An n×k matrix V with orthonormal columns spanning a rank-k (1+)-approximation to φ(A) may be found in O((A)q)+n·(3^(q)k/ε) time. At this point, the regression problem argmin_(x)∥Vx−b∥₂ may be determined by the regression process 350 exactly in O(nk) time since the minimizer is x=V^(T)b.

In one example embodiment, canonical correlation analysis (CCA) provides two matrices A, B and it is desired to find directions in which the spaces spanned by their columns are correlated. A formal linear algebraic definition is as follows. Let A ε^(m×n) and B ε^(m×l), and assume that p=(A)≧(B)=q. The canonical correlations σ₁(A, B)≧σ₂ (A, B)≧ . . . ≧σ_(q) (A, B) of the matrix pair (A, B) are defined recursively by the following formula:

${{\sigma_{i}\left( {A,B} \right)} = {{\max\limits_{{x \in A_{i}},{y \in B_{i}}}{{\sigma \left( {{Ax},{By}} \right)}\mspace{14mu} \text{=}}}:{\sigma \left( {{Ax}_{i},{By}_{i}} \right)}}},{i = 1},\ldots \mspace{11mu},q$

where

-   -   σ(u, v)=|u^(T)v|/(∥u∥₂∥v∥₂),     -   A_(i)={x: Ax≠0, Ax⊥{Ax₁, . . . , Ax_(i−1)}},     -   B_(i)={y: By≠0, By⊥{By₁, . . . , By_(i−1)}}.         The unit vectors Ax₁/∥Ax₁∥₂, . . . , Ax_(q)/∥Ax_(q)∥₂,         By₁/∥By₁|₂, . . . , By_(q)/∥By_(q)∥₂ are referred to as the         canonical or principal vectors. The vectors x₁/∥Ax₁∥₂, . . . ,         x_(q)/∥Ax_(q)∥₂, y₁/∥By₁|₂, . . . , y_(q)/∥By_(q)|₂ are referred         to as canonical weights (or projection vectors). Note that the         canonical weights and the canonical vectors are not uniquely         defined.

CCA finds correlations between any parts of the spectrum. Potentially, random correlation between noise might be found, skewing the results. The problem is aggravated in the kernel setting (where correlations are found between φ(A) and φ(B)): if d^(q)>n then there is exact correlation between the subspaces. One common way to address this is to add ridge terms. In one embodiment, the sketching module 340 considers a different form of regularization: finding the correlations between the dominant subspaces of A and B (their principal components). In one example embodiment, an approximate version is introduced. In the Approximate Principal Component CCA Problem (Approximate PC-CCA), given an n×d₁ matrix A and an n×d₂ matrix B, the goal is to find two n×k orthonormal matrices U and V, where U spans a rank-k approximation to A and V spans a rank-k approximation to B, and output the CCA between U and V.

In one embodiment, for the polynomial kernel of degree q, in O(((A)+(B))q)+n·(3^(q)k/ε) time one embodiment can solve the approximate PC-CCA problem on φ(A) and φ(B). In one example, an n×k matrix V with orthonormal columns spanning a rank-k (1+)-approximation to φ(A) can be found in O((A) q)+n·(3^(q) k/ε) time, a n×k matrix U with orthonormal columns spanning a rank-k (1+)-approximation to φ(B) in O((B)q)+n·(3^(q)k/ε) time. At this point, in one embodiment, the sketching module 340 computes the CCA between U and V, which amounts to computing an SVD on U^(T)V. This takes O(nk²) time.

Two sets of example embodiments using the sketching module 340 have the goal of demonstrating that the k-Space process 1 is useful as a feature extraction process. In one example, standard classification and regression datasets. In the first set of example embodiments, ordinary l₂ regression is compared to approximate principal component l₂ regression, where the approximate principal components are extracted using k-Space (RLSC is used for classification). Specifically, as explained above, one embodiment uses k-Space to compute V and then use regression on V (in one dataset we also add an additional ridge regularization) using the regression process 350. To predict, it is noticed that V=φ(A)·S·R⁻¹ W, where R is the R factor of φ(A)·S, so S·R⁻¹·W defines a mapping to the approximate principal components. To predict on a matrix A_(t) the sketching module first computes φ(A_(t))·S·R⁻¹·W (using TensorSketch to compute φ(A_(t))·S fast) and then multiply by the coefficients found by the regression by the regression process 350. In all the example embodiments, φ(•) is defined using the kernel k (u, v)=(u^(T)v+1)³.

k-Space can be rather expensive both in time and in memory to compute, so in one example embodiment, feature extraction using only a subset of the training set is implemented. Specifically, first the dataset is sampled and stored in the storage module 310, then k-Space is used to compute the mapping S·R⁻¹·W. This mapping is applied to the entire dataset before performing regression by the regression process 350. The results of the first example embodiment is shown in Table 1. Since k-Space is randomized, the mean and standard deviation of five (5) runs are shown in Table 1. For all datasets, learning with the extracted features yields better generalized errors than learning with the original features. Extracting the features using only a sample of the training set results in only slightly worse generalization errors.

A similar setup is used in the second set of example embodiments, now using linear SVM instead of regression (run only on the classification datasets). The results are shown in Table 2. Although the gap is smaller, it may be seen again that generally the extracted features lead to better generalization errors.

For Table 1 below: comparison of testing error with using regression with original features and with features extracted using k-Space. In Table 1, n is number of training instances, d is the number of features per instance and n_(t) is the number of instances in the test set. “Regression” stands for ordinary l₂ regression. “PCA Regression” stand for approximate principal component l₂ regression. “Sample PCA Regression” stands approximate principal component l₂ regression where only n_(s) samples from the training set are used for computing the feature extraction. In “PCA Regression” and “Sample PCA Regression” k features are extracted. In k-Space process 1, m=O(k) and r=O(k) are used with the ratio between m and k and r and k as detailed in Table 1. For classification tasks, the percent of testing points incorrectly predicted is reported. For regression tasks, ∥y_(p)−y∥₂/∥y∥ where y_(p) is the predicted values and y is the ground truth are shown.

TABLE 1 Sampled PCA PCA Dataset Regression Regression Regression mnist   14% Out of  7.9% ± 0.06% classification Memory k = 500, n_(s) = 5000 n = 60,000, m/k = 2 d = 784 n_(t) = 10,000 r/k = 4 cpu   12% 4.3% ± 1.0% 3.6% ± 0.1% regression k = 200 k = 200, n_(s) = 2000 n = 6,554, m/k = 4 m/k = 4 d = 21 n_(t) = 819 r/k = 8 r/k = 8 adult 15.3% 15.2% ± 0.1%  15.2% ± 0.03% classification k = 500 k = 500, n_(s) = 5000 n = 32,561, m/k = 2 m/k = 2 d = 123 n_(t) = 16,281 r/k = 4 r/k = 4 census  7.1% 6.5% ± 0.2% 6.8% ± 0.4% regression k = 500 k = 500, n_(s) = 5000 n = 18,186, m/k = 4 m/k = 4 d = 119 n_(t) = 2,273 r/k = 8 r/k = 8 λ = 0.001 λ = 0.001 usps 13.1% 7.0% ± 0.2% 7.5% ± 0.3% classification k = 200 k = 200, n_(s) = 2000 n = 7,291, m/k = 4 m/k = 4 d = 256 n_(t) = 2,007 r/k = 8 r/k = 8

For Table 2: comparison of testing error with using SVM with original features and with features extracted using k-Space. In Table 2, n is number of training instances, d is the number of features per instance and n_(t) is the number of instances in the test set. “SVM” stands for linear SVM. “PCA SVM” stands for using k-Space to extract features, and then using linear SVM. “Sample PCA SVM” stands for using only n_(s) samples from the training set are used for computing the feature extraction. In “PCA SVM” and “Sample PCA SVM” k features are extracted. In k-Space process 1, m=O(k) and r=O(k) with the ratio between m and k and r and k are used as detailed in Table 2. For classification tasks, the percent of testing points incorrectly predicted is reported.

TABLE 2 Sampled Dataset SVM PCA SVM PCA SVM mnist 8.4% Out of 6.1% ± 0.1% classification Memory k = 500, n_(s) = 5000 n = 60,000, m/k = 2 d = 784 n_(t) = 10,000 r/k = 4 adult 15.0% 15.1% ± 0.1%  15.2% ± 0.1%  classification k = 500 k = 500, n_(s) = 5000 n = 32,561, m/k = 2 m/k = 2 d = 123 n_(t) = 16,281 r/k = 4 r/k = 4 usps 8.3% 7.2% ± 0.2% 7.5% ± 0.3% classification k = 200 k = 200, n_(s) = 2000 n = 7,291, m/k = 4 m/k = 4 d = 256 n_(t) = 2,007 r/k = 8 r/k = 8

In one embodiment, it should be noted that it is not a goal to show that the k-Space process 1 is the best feature extraction algorithm of the classification algorithms considered (RLSC and SVM), but rather that it can be used to extract features of higher quality than the original one. In fact, in the example embodiments, while for a fixed number of extracted features, k-Space process 1 produces better features than simply using TensorSketch, it is also more expensive in terms of time. If that additional time is used to do learning or prediction with TensorSketch with more features, an overall better generalization error may be obtained. However, feature extraction is widely applicable, and there may be cases where having fewer high quality features is beneficial, e.g., performing multiple learning on the same data, or a very expensive learning tasks.

FIG. 4 is a block diagram showing a process 400 for sketching for M-estimators for performing regression, in accordance with an embodiment of the invention. In one embodiment, in block 410, one or more sets of input data (e.g., samples/sampling, collected data, etc.) is provided. In one embodiment, in block 420 a matrix A and a vector b are generated using the one or more sets of input data (e.g., using the matrix processing module 320 and vector processing module 330, FIG. 3). In one embodiment, in block 430 a processor device is used for processing the matrix A and the vector b based on a randomized sketching matrix S (e.g., as described in relation to FIG. 3). In one embodiment, in block 440 a vector x is provided that minimizes a normalized measure function based on the matrix A and the vector b. In one embodiment, in block 450 a relationship between the one or more sets of input data is determined based on the vector x.

In one embodiment, process 400 may include determining the vector x based on determining the min_(x)∥Ax−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i)G(y_(i)), where i, n and d are positive integers, and the matrix A comprises an n×d matrix. In one embodiment, the sketching matrix S is used to estimate ∥Ax∥_(G), for all xε

^(d). In one embodiment, the sketching matrix S is used for a plurality of M-estimators requiring O(nnz(A)+poly(d log n)) time, wherein the processor device makes one pass over the matrix A for determining a regression solution with a residual error within a constant factor of a specified tolerance and with constant probability, wherein nnz comprises a number of non-zero entries of the matrix A.

In one embodiment, the process 400 may include providing an OSE for embedding a space induced by a nonlinear (polynomial) kernel for the matrix A. In one embodiment, embedding the space comprises applying a map φ(A) to rows of A, wherein the map φ(A) corresponds to the nonlinear kernel. In one embodiment, the process 400 may include determining the vector x based on determining min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i)G(y_(i)), where i, n and d are positive integers, and the matrix A comprises an n×d matrix

FIG. 5 is a block diagram showing a process 500 for sketching using an OSE for embedding a space induced by a nonlinear kernel, in accordance with an embodiment of the invention. In one embodiment, in block 510, one or more sets of input data (e.g., samples/sampling, collected data, etc.) is provided. In one embodiment, in block 520 a matrix A and a vector b are generated using the one or more sets of input data (e.g., using the matrix processing module 320 and vector processing module 330, FIG. 3). In one embodiment, in block 530, an OSE is provided for applying a map φ(A) to rows of the matrix A, where the map φ corresponds to a nonlinear kernel. In one embodiment, in block 540 a processor device is used for processing map φ(A) and the vector b based on a randomized sketching matrix S (e.g., as described in relation to FIG. 3). In one embodiment, in block 550 a vector x is provided that minimizes a normalized measure function based on the map φ(A) and the vector b. In one embodiment, in block 560 a relationship between the one or more sets of input data is determined based on the vector x.

In one embodiment, in process 500 determining the vector x comprises determining the min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers, and the matrix A comprises an n×d matrix. In one embodiment, in process 500 the sketching matrix S is used to estimate ∥φ(A)x∥_(G), for all xε

^(d). In one embodiment, in process 500 the map φ(A) maps each row of A to a higher dimensional space using a polynomial kernel of a given degree.

In one embodiment, process 500 may further include computing an implicit low rank decomposition of φ(A), and performing PCR with respect to an approximation of top k left singular vectors of φ(A), where k is a positive integer.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

References in the claims to an element in the singular is not intended to mean “one and only” unless explicitly so stated, but rather “one or more.” All structural and functional equivalents to the elements of the above-described exemplary embodiment that are currently known or later come to be known to those of ordinary skill in the art are intended to be encompassed by the present claims. No claim element herein is to be construed under the provisions of 35 U.S.C. section 112, sixth paragraph, unless the element is expressly recited using the phrase “means for” or “step for.”

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof

The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated. 

What is claimed is:
 1. A method comprising: providing one or more sets of input data; generating a matrix A and a vector b using the one or more sets of input data; processing, using a processor device, the matrix A and the vector b based on a randomized sketching matrix S; determining a vector x that minimizes a normalized measure function based on the matrix A and the vector b; and determining a relationship between the one or more sets of input data based on the vector x.
 2. The method of claim 1, wherein said determining the vector x comprises determining the min_(x)∥Ax−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers, and the matrix A comprises an n×d matrix.
 3. The method of claim 2, wherein the sketching matrix S is used to estimate ∥Ax∥_(G), for all xε

^(d).
 4. The method of claim 3, wherein the sketching matrix S is used for a plurality of M-estimators requiring O(nnz(A)+poly(d log n)) time, wherein the processor device makes one pass over the matrix A for determining a regression solution with a residual error within a constant factor of a specified tolerance and with constant probability, wherein nnz comprises a number of non-zero entries of the matrix A.
 5. The method of claim 1, further comprising: providing an oblivious subspace embedding (OSE) for embedding a space induced by a nonlinear polynomial kernel for the matrix A.
 6. The method of claim 5, wherein embedding the space comprises applying a map φ(A) to rows of A, wherein the map φ(A) corresponds to the nonlinear polynomial kernel.
 7. The method of claim 6, wherein determining the vector x comprises determining min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers, and the matrix A comprises an n×d matrix.
 8. A computer program product for determining relationships between data, the computer program product comprising a computer readable storage medium having program code embodied therewith, the program code readable/executable by a processor to: receive one or more sets of input data; generate a matrix A and a vector b based on the one or more sets of input data; process the matrix A and the vector b based on a randomized sketching matrix S; determine a vector x that minimizes a normalized measure function based on the matrix A and the vector b; and determine a relationship between the one or more sets of input data based on the vector x.
 9. The computer program product of claim 8, wherein said determining the vector x comprises determining the min_(x)∥Ax−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers, and the matrix A comprises an n×d matrix.
 10. The computer program product of claim 9, wherein the sketching matrix S is used to estimate ∥Ax∥_(G), for all xε

^(d).
 11. The computer program product of claim 10, wherein the sketching matrix S is used for a plurality of M-estimators requiring O(nnz(A)+poly(d log n)) time, wherein the processor device makes one pass over the matrix A for determining a regression solution with a residual error within a constant factor of a specified tolerance and with constant probability, wherein nnz comprises a number of non-zero entries of the matrix A.
 12. The computer program product of claim 8, wherein the program code readable/executable by the processor further to: provide an oblivious subspace embedding (OSE) for embedding a space induced by a nonlinear polynomial kernel for the matrix A.
 13. The computer program product of claim 12, wherein embedding the space comprises applying a map φ(A) to rows of A, wherein the map φ(A) corresponds to the nonlinear polynomial kernel.
 14. The computer program product of claim 8, wherein determine the vector x comprises determine min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers, and the matrix A comprises an n×d matrix.
 15. A method comprising: providing one or more sets of input data; generating a matrix A and a vector b using the one or more sets of input data; providing an oblivious subspace embedding (OSE) for applying a map φ(A) to rows of the matrix A, wherein the map φ corresponds to a nonlinear kernel; processing, using a processor device, the map φ(A) and the vector b based on a randomized sketching matrix S; determining a vector x that minimizes a normalized measure function based on the map φ(A) and the vector b; and determining a relationship between the one or more sets of input data based on the vector x.
 16. The method of claim 15, wherein said determining the vector x comprises determining the min_(x)∥φ(A)x−b∥_(G), where Aε

^(n×d) and bε

^(n), and ∥y∥_(G) for yε

^(n) is specified by the measure function G:

⁺, wherein ∥y∥_(G)≡Σ_(i) G(y_(i)), where i, n and d are positive integers, and the matrix A comprises an n×d matrix.
 17. The method of claim 16, wherein the sketching matrix S is used to estimate ∥φ(A)x∥_(G), for all xε

^(d).
 18. The method of claim 17, wherein the map φ(A) maps each row of A to a higher dimensional space using a polynomial kernel of a given degree.
 19. The method of claim 18, further comprising: computing an implicit low rank decomposition of φ(A).
 20. The method of claim 19, further comprising: performing principal component regression (PCR) with respect to an approximation of top k left singular vectors of φ(A), wherein k is a positive integer. 